48 research outputs found

    BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies

    Get PDF
    The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years

    Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery

    Get PDF
    Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507)

    Robustness of Massively Parallel Sequencing Platforms

    Get PDF
    The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications

    Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA

    Get PDF
    MOTIVATION: Cancer is a complex disease that involves rapidly evolving cells, often forming multiple distinct clones. In order to effectively understand progression of a patient-specific tumor, one needs to comprehensively sample tumor DNA at multiple time points, ideally obtained through inexpensive and minimally invasive techniques. Current sequencing technologies make the 'liquid biopsy' possible, which involves sampling a patient's blood or urine and sequencing the circulating cell free DNA (cfDNA). A certain percentage of this DNA originates from the tumor, known as circulating tumor DNA (ctDNA). The ratio of ctDNA may be extremely low in the sample, and the ctDNA may originate from multiple tumors or clones. These factors present unique challenges for applying existing tools and workflows to the analysis of ctDNA, especially in the detection of structural variations which rely on sufficient read coverage to be detectable. RESULTS: Here we introduce SViCT , a structural variation (SV) detection tool designed to handle the challenges associated with cfDNA analysis. SViCT can detect breakpoints and sequences of various structural variations including deletions, insertions, inversions, duplications and translocations. SViCT extracts discordant read pairs, one-end anchors and soft-clipped/split reads, assembles them into contigs, and re-maps contig intervals to a reference genome using an efficient k-mer indexing approach. The intervals are then joined using a combination of graph and greedy algorithms to identify specific structural variant signatures. We assessed the performance of SViCT and compared it to state-of-the-art tools using simulated cfDNA datasets with properties matching those of real cfDNA samples. The positive predictive value and sensitivity of our tool was superior to all the tested tools and reasonable performance was maintained down to the lowest dilution of 0.01% tumor DNA in simulated datasets. Additionally, SViCT was able to detect all known SVs in two real cfDNA reference datasets (at 0.6-5% ctDNA) and predict a novel structural variant in a prostate cancer cohort

    Scalable mapping and compression of high throughput genome sequencing data

    Get PDF
    The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing, downstream analysis and computational infrastructure. HTS has become an invaluable technology for many applications, e.g. the detection of single-nucleotide polymorphisms, structural variations. In most of these applications, mapping sequenced ``reads\u27\u27 to their potential genomic origin is the first fundamental step for subsequent analyses. Many tools have been developed to address this problem. Because of the large amount of HTS data availability, much emphasis has been placed on speed and memory. In fact, as HTS data grow in size, data management and storage are becoming major logistical obstacles for adopting HTS-platforms. The requirements for ever increasing monetary investment almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information, which holds most of the sequence data generated world wide. One way to solve storage requirements for HTS data is compression. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not specifically designed for compressing data generated by the HTS-platforms. Recently, a number of fast and efficient compression algorithms have been designed specifically for HTS data to address some of the issues in data management, storage and communication. In this thesis, we study both of these computational problems, i.e., Sequence Mapping and Sequence Compression extensively. We introduce two novel methods namely mrsFAST and drFAST to map HTS short-reads to the reference genome. These methods are cache oblivious and guarantee perfect sensitivity. Both are specifically designed to address the bottleneck of multi-mapping for the purpose of structural variation detection. In addition we present Dissect for mapping whole trascriptome to the genome while considering structural alterations in the transcriptome. Dissect is designed specifically to map HTS long-reads as well as assembled contigs. Finally, we address the storage and communication problems in HTS data by introducing SCALCE, a "boosting\u27\u27 scheme based on Locally Consistent Parsing technique. SCALCE re-orders the data in order to increase the locality of reference and subsequently improve the performance of well-known compression methods in terms of speed and space

    Solving NP search problems with model expansion

    Get PDF
    We explore the application of MXG, a declarative programming solver for NP search problems based on Model Expansion (MX) for first order logic with inductive definitions. We present specifications for several common NP-complete benchmark problems in the language of MXG, and describe some modeling techniques we found useful in obtaining good solver performance. We present an experimental comparison of the performance of MXG with Answer Set Programming (ASP) solvers on these problems, showing that MXG is competitive and often better. As an extended example, we consider an NP-complete phylogenetic inference problem. We present several specifications for this problem, employing a variety of techniques for obtaining good performance. Our best solution, which combines instance pre-processing, redundant axioms, and symmetry breaking axioms, performs orders of magnitude faster than previously reported declarative programming solutions using ASP solvers

    Genion, an accurate tool to detect gene fusion from long transcriptomics reads

    No full text
    Background The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges. Results We present Genion, a sensitive and fast gene fusion detection method that can also detect read-through events. We compare Genion against a recently introduced long-read gene fusion discovery method, LongGF, both on simulated and real datasets. On simulated data, Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF. Furthermore, our results on the breast cancer cell line MCF-7 show that Genion correctly identifies all the experimentally validated gene fusions. Conclusions Genion is an accurate gene fusion caller. Genion is implemented in C++ and is available at https://github.com/vpc-ccg/genion .Medicine, Faculty ofNon UBCUrologic Sciences, Department ofReviewedFacultyResearche

    Fast characterization of segmental duplication structure in multiple genome assemblies

    No full text
    Motivation The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Results Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7–33 ×\times × speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years. Availability and implementation BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser .Medicine, Faculty ofNon UBCUrologic Sciences, Department ofReviewedFacultyResearche
    corecore